Blocking Blog Spam with Language Model Disagreement
نویسندگان
چکیده
We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments. In contrast to other link spam filtering approaches, our method requires no training, no hard-coded rule sets, and no knowledge of complete-web connectivity. Preliminary experiments with identification of typical blog spam show promising results.
منابع مشابه
Online Spam Detection in Blogs: A Behavior-based Approach
With the increasing usage of user generated content based social networks, spam content is surging by taking advantage of the convenience of web posting. Modern spammers in social networks insert popular keywords or even copy and paste recent articles on the web with spam links inserted, in order to evade language model based spam detection. In this paper, we first conduct a comprehensive analy...
متن کاملAIRWeb 2005 Proceedings
We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments. In contrast to other link spam filtering approaches, our method requires no training, no hard-coded rule sets, and no knowledge of complete-web connectivity. Preliminary experiments with identification of typical blog spam ...
متن کاملLibrary blogs and user participation: a survey about comment spam in library blogs
Purpose The purpose of this research is to identify and describe the impact of comment spam in library blogs. Three research questions guided the study: current level of commenting in library blogs; librarians' perception of comment spam; and techniques used to address the comment spam problem. Design/methodology/approach A quantitative approach is used to investigate research questions. Inform...
متن کاملHuman Language Technology Conference of the North American Chapter of the Association of Computational Linguistics Proceedings of the Main Conference
Email is the number one activity that people do on the internet: 74% of internet users check their email on an average day. Email use in offices has more than doubled since 2000, and is now over 8 hours a week. There are many great NLP problems for email, like automatic clustering and foldering, search, prioritization, automatically finding keywords within messages, finding addresses, and summa...
متن کاملDetecting Blog Spams using the Vocabulary Size of All Substrings in Their Copies
This paper addresses the problem of detecting blog spams, which are unsolicited messages on blog sites, among blog entries. Unlike a spam mail, a typical blog spam is produced to increase the PageRank for the spammer’s Web sites, and so many copies of the blog spam are necessary and all of them contain URLs of the sites. Therefore the number of the copies, we call it the frequency, seems to be ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005